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ABSTRACT 

Modern problems in astronomical Bayesian inference require efficient methods for sam¬ 
pling from complex, high-dimensional, often multi-modal probability distributions. 
Most popular methods, such as Markov chain Monte Carlo sampling, perform poorly 
on strongly multi-modal probability distributions, rarely jumping between modes or 
settling on just one mode without finding others. Parallel tempering addresses this 
problem by sampling simultaneously with separate Markov chains from tempered ver¬ 
sions of the target distribution with reduced contrast levels. Gaps between modes can 
be traversed at higher temperatures, while individual modes can be efficiently explored 
at lower temperatures. In this paper, we investigate how one might choose the ladder 
of temperatures to achieve more efficient sampling, as measured by the autocorrelation 
time of the sampler. In particular, we present a simple, easily-implemented algorithm 
for dynamically adapting the temperature configuration of a sampler while sampling. 
This algorithm dynamically adjusts the temperature spacing to achieve a uniform rate 
of exchanges between chains at neighbouring temperatures. We compare the algorithm 
to conventional geometric temperature configurations on a number of test distribu¬ 
tions and on an astrophysical inference problem, reporting efficiency gains by a factor 
of 1.2-2.5 over a well-chosen geometric temperature configuration and by a factor of 
1.5-5 over a poorly chosen configuration. On all of these problems a sampler using 
the dynamical adaptations to achieve uniform acceptance ratios between neighbouring 
chains outperforms one that does not. 


1 INTRODUCTION 

Many problems in astronomical data analysis and Bayesian 
statistical inference demand the characterisation of high¬ 
dimensional probability distributions with complicated 
structures. Lacking analytic forms, these distributions must 
be explored numerically, usually via Monte Carlo methods. 

Parallel tempered Markov chain Monte Carlo (MCMC), 
a development on standard MCMC, uses several Markov 
chains in parallel to explore a target distribution at differ¬ 
ent “temperatures” (Earl & Deem 2005; Swendsen & Wang 
1986; Geyer 1991). As the temperature increases, the poste¬ 
rior distribution asymptotes to the prior, allowing a chain to 
efficiently explore the whole prior volume without becoming 
stuck in regions of the parameter space with high probability 
density. At lower temperatures, a chain can more efficiently 
sample from such a high-probability region. Meanwhile, ex¬ 
change of positions between chains allows colder chains to 
migrate between widely separated modes in the parameter 
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space (Geyer 1991). Parallel tempered MCMC samplers are 
thus particularly well-suited to sampling posterior distribu¬ 
tions with well-separated modes, where a regular MCMC 
sampler would take many iterations to find its way between 
modes. 

An open problem in the application of parallel temper¬ 
ing is selecting a specification, or ladder, of temperatures 
that minimises the autocorrelation time (ACT) of the chain 
sampling the posterior distribution of interest. The efficiency 
of a given ladder hinges critically on the rate at which it can 
transfer the positions in parameter space of samples between 
high and low temperatures. 

In this paper we present a simple algorithm that adapts 
the temperature ladder of an ensemble-based parallel tem¬ 
pered MCMC sampler (Goodman & Weare 2010) such that 
the rate of exchange between chains is uniform over the 
entire ladder. The algorithm is easy to implement in ex¬ 
isting code, and we provide an example implementation 
for the EMCEE sampler of Foreman-Mackey et al. (2013). 
We also present an implementation for traditional, non¬ 
ensemble MCMC samplers where a single walker explores 
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the parameter space. We report favourable results from such 
an implementation, along with a number of caveats. 

In Section 2 we describe the parallel tempering formal¬ 
ism and lay out the requirements for a good temperature 
ladder. We discuss previous work on temperature selection 
and suggest a definition of ladder optimality that, for simple 
cases, proposes a geometric spacing of temperatures. For il¬ 
lustration, we apply these ideas in Section 2.2 to the simple 
example of an unbounded Gaussian posterior distribution. 

In Section 3 we describe the algorithm mentioned above 
and then apply it in Section 4 to a variety of test distribu¬ 
tions. We show that, while our temperature selection strat¬ 
egy is not necessarily optimal in the ACT of the sampler, it 
nonetheless improves the ACT compared to the simple geo¬ 
metric spacing that is conventional in the literature (Earl & 
Deem 2005; Sugita & Okamoto 1999; Kofke 2002, 2004) by 
factors of > 1.2 for our test cases. 

In Section 5 we apply our method to the astrophysically 
motivated - and more challenging - problem of parameter 
estimation in the setting of gravitational wave (GW) data 
analysis, using a single-walker MCMC sampler. We demon¬ 
strate a reduction in ACT by as much as a factor of 2 over 
a geometric ladder, despite the caveats mentioned in Sec¬ 
tion 6.1. 

We conclude in Section 6 with a discussion of our results 
and suggestions for further research. 


2 PARALLEL TEMPERING 

Parallel tempering (Earl & Deem 2005; Swendsen & Wang 
1986; Geyer 1991) is a development on the standard MCMC 
formalism that uses several Markov chains in parallel to sam¬ 
ple from tempered versions of the posterior distribution 7r, 

n T (0) « m 1/T p(e), (1) 

where L and p are respectively the likelihood and prior dis¬ 
tributions. 

For high T, individual peaks in L become flatter 
and broader, making the distribution easier to sample via 
MCMC. A set of N chains is assigned temperatures in a 
ladder T\ < T 2 < ... < T/v, with T\ — 1 (the target temper¬ 
ature) . The temperatures are typically geometrically spaced 
from 1 up to some T max , decided in advance (a convention 
that we shall discuss in more detail in Section 2.2). 

Each chain is allowed to explore its tempered distribu¬ 
tion 7 tt under an MCMC algorithm, while at pre-determined 
intervals “swaps” are proposed between (usually adjacent 1 ) 
pairs of chains and accepted with probability 

* j=min {(M)’ (2> 

where 0i is the current position in the parameter space of 
the i th chain and Pi = 1/Ti is the inverse temperature of 
this chain. When a swap is accepted, the chains exchange 

1 In principle, swaps can be proposed between any pair of chains. 
However, since the swap acceptance ratio (2) decays exponentially 
with the separation of inverse temperatures, A/3, it is generally 
sufficient only to propose swaps between adjacent chains. 


their positions in the parameter space, so that chain i is at 
0j and chain j is at 0*. Since the hottest chains can access 
all of the modes of i r (as long as T max is chosen appropri¬ 
ately), their locations propagate to colder chains, ultimately 
allowing the T — 1 (cold) chain to efficiently explore the en¬ 
tire target distribution. At the same time, the positions of 
the colder chains propagate upward to higher temperature 
chains, where they are free to explore the entire prior vol¬ 
ume. 

The goal in choosing an effective ladder of temperatures 
is to minimise the ACT of the cold chain (our measure of 
the efficiency of the sampler). The requirements to this end 
are two-fold: 

(i) T max must be large enough that isolated modes of L 
broaden sufficiently that an individual MCMC chain can 
efficiently access all of these modes when sampling under 
the tempered posterior ttt in (1) at T = T max . We denote 
this temperature T pr i or . 

(ii) Since Aij depends on Pi — Pj , the differences be¬ 
tween temperatures must be small enough that neighbouring 
chains can communicate their positions efficiently with one 
another. 

Both requirements depend sensitively on the (unknown) 
shape of the target distribution, so it is difficult to select 
temperatures appropriately in advance. 

In choosing T max , one must know roughly the relative 
size and separation of the modes to be explored. As an exam¬ 
ple, consider a one-dimensional likelihood with two Gaussian 
modes of width a = 1 and centres p = ±10. In order to pre¬ 
vent a sampler from getting stuck on one of the modes, they 
must be widened to roughly the separation between them 2 , 
giving a = 0(10). The width of a Gaussian peak scales with 
the temperature as y/T, so we might choose T max = 100; 
Figure 1 illustrates the resulting coalescence of the modes. 
A different configuration of modes will, of course, require a 
different T max . 

On the other hand, the swap acceptance probability Aij 
depends on the distribution of likelihood values at temper¬ 
atures Ti and Tj. In the case of a likelihood distribution 
comprising a single Gaussian mode, the time-averaged ac¬ 
ceptance ratios between chains, E [Aij], can be computed 
analytically (see Section 2.2). 

In general, we don’t know in advance what the target 
distribution looks like, and so choosing an effective ladder 
becomes a heuristic exercise, relying largely on educated 
guesswork. We are therefore motivated to find some method 
of empirically determining an effective ladder. 

2.1 Ladder selection 

For an n-dimensional problem, the conventional choice of 
temperatures is a geometrically spaced ladder constructed so 
that approximately 23% of swaps proposed between chains 
will be accepted when sampling from an n-dimensional, 
unbounded Gaussian distribution (Earl & Deem 2005; ter 

2 Ideally, the modes must also be widened enough that they 
extend to the edges of the prior volume. A likelihood distribution 
with a single mode that occupies only a small fraction of the prior 
volume will take a long time to burn in. 
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Figure 1 . A one-dimensional target distribution with two Gaus¬ 
sian peaks of width a = 1 at p, = ±10 normalised for a uniform 
prior over [—20, 20]. At T = 100, the peaks broaden to a = 10, al¬ 
lowing an MCMC chain sampling at this temperature to find both 
modes quickly, starting from anywhere within the prior volume. 


Braak & Vrugt 2008; Roberts & Rosenthal 1998). We shall 
discuss this convention in more detail in Section 2.2. 

A consequence of this strategy is that increasing the 
number of chains N does not improve communication be¬ 
tween existing chains, which is determined by E[Aij] s= 0.23. 
Instead, adding new chains extends the ladder to higher tem¬ 
peratures. This may be appropriate for an unbounded pos¬ 
terior, but for a realistic problem with a finite prior volume, 
the acceptance ratio between adjacent chains saturates to 
~ 100% at some temperature T pr i or , at which the posterior 
7 it begins to look like the prior p. 

For this geometric spacing scheme - where T pr i or is un¬ 
known - there is therefore an optimal number of chains, 
N opt, such that T pr i or ~ -^A/opt = T max . For N <C. -/V op t none 
of the chains will be sampling from the prior (so the sampler 
may not find all of the modes), while for N > N op t we end 
up with several chains sampling redundantly from the prior. 

Since we are generally ignorant of T pr i or for the problem 
at hand, we are motivated to find an alternative temperature 
selection strategy. 

It has been suggested in the literature (Earl & Deem 
2005; Sugita &, Okamoto 1999; Kofke 2002, 2004) that one 
could select temperatures such that the acceptance ratios 
Aij are uniform for all pairs (i,j) of adjacent chains, in 
an attempt to ensure that each sample sequence 0(t) for 
t = 1,2,..., as it moves between chains, spends an equal 
amount of time at every temperature. Sugita & Okamoto 
(1999) justify this notion experimentally - in the context of 
molecular dynamics - with test cases in which such a lad¬ 
der indeed performs well. They use an algorithm derived 
from that of Hukushima & Nemoto (1996), which selects 
temperatures according to an iterative process for which a 
uniform-A ladder is a fixed point. Earl & Deem (2005) pro¬ 
vide further references for similar methods of determining 
temperature ladders that yield a given a target acceptance 
ratio (Rathore et al. 2005; Sanbonmatsu & Garcia 2002; 
Schug et al. 2004). However, these methods do not address 


requirement (i), discussed above, that the temperature lad¬ 
der should reach a T max sufficient for all of the modes of L 
to mix (specified by T pr i or ). 

Kofke (2002) discusses the selection of temperature lad¬ 
ders in the context of molecular simulations. He shows that, 
in simulations of such thermodynamic systems, there is a 
close relation between the specific heat of the system, CV, 
and the acceptance ratios between adjacent temperatures. 
In particular, when Cv is constant with respect to T over a 
given temperature interval, then a geometric spacing of tem¬ 
peratures on that interval yields uniform acceptance ratios 
between adjacent temperatures. 

In the language of thermodynamics, the energy of the 
system, U, is analogous to — logL, and an analogue to the 
specific heat can therefore be defined as 

Cv(T) = -^E[logL] T , (3) 

where E[*]t denotes the expectation operator over 9 under 
the distribution i E[log L]t is therefore the expectation 
of the untempered log likelihood collected when sampling 
from the posterior at temperature T. 

In the context of Bayesian inference, Kofke’s result 
therefore tells us that if the mean log likelihood collected by 
a sampler responds linearly to changes in temperature, then 
a geometrically spaced temperature ladder will achieve uni¬ 
form acceptance ratios between adjacent chains. Conversely, 
temperature intervals on which E[log L\t is strongly non¬ 
linear in T represent a phase transition that will require 
more careful placement of temperatures, as we shall show in 
Section 4. 


2.2 The ideal Gaussian distribution: a simple 
example 


In the simple case of a unimodal Gaussian likelihood under 
a flat prior, the optimal temperature spacing at low tem¬ 
peratures - where very little likelihood mass is truncated by 
the prior - can be analysed by approximating the prior to 
be unbounded 3 . We show that, for this tractable example, a 
geometric temperature spacing is consistent with both the 
uniform-A criterion and also with the alternative criterion 
that the Kullback-Leibler (KL) divergence is uniform be¬ 
tween ah pairs of adjacent chains. We use the example to il¬ 
lustrate the relationship between the analytical distribution 
of logL, the acceptance ratio Aij, and the temperature T. 

We shah work with an n-dimensional unit Gaussian cen¬ 
tred on the origin (the same result can be achieved for a 
general Gaussian through a simple change of coordinates). 
Since the prior is uniform and unbounded, we can restrict 
attention to the likelihood distribution L. In this case, the 
probability density p for the values of logL(0) collected by 
the sampler is 


p(log L) = 


e logL (_log Lp- 1 

r(f) 


( 4 ) 


3 The approximation breaks down at higher temperatures, where 
boundary effects become significant. Indeed, with no prior bound¬ 
aries, there is no T pr i or at which the mode is spread over the entire 
prior volume. 
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Figure 2. The distribution of logL under a three-dimensional, 
unimodal Gaussian at various temperatures, where L is nor¬ 
malised so that logL(O) = 0. As T —y oo, the variance of logL 
diverges. The legend is ordered to match the vertical order of the 
lines’ peaks. 


Figure 3. The time-averaged acceptance ratio, E[A], between 
two chains of a PTMCMC sampler on a unimodal, n-dimensional 
Gaussian likelihood distribution. The chains have temperatures 
T and 7 T. The lines are ordered vertically to match the legend. 


where L is normalised so that logL(O) = 0 and n is the 
number of parameters. 

At a temperature T, — log L simply follows a gamma 
distribution r(a, /3) with shape parameter a — n /2 and rate 
parameter /3 = 1/T. Thus, for a chain sampling at tem¬ 
perature T, the log likelihood distribution is pr(}ogL) = 
T p(logL/T). 

Over long time-scales, the average acceptance ratio be¬ 
tween chains i and j is 


E l A i,j] = jf AiJ PTi (log Li) pTj (log Lj ) dlog Li dlog Lj 


( — 00,o ] 2 

A 


= [ G— 


1 n n „ 1 , 


77^1 ( + ) | + 1 , 


( 5 ) 


where 2 T 1 is the regularised Gauss hypergeometric function 
and 7 +j = Tj/Ti is the ratio between the temperatures of 
two chains. Since E[A+j] depends on T% and Tj only through 
the ratio 7 +j, uniform acceptance ratios between ah adjacent 
pairs of chains can be achieved with a geometric spacing of 
temperatures - where 7 * 7+1 is constant - for a unimodal 
Gaussian likelihood. 

The log spacing required for a particular acceptance ra¬ 
tio also depends on the dimension of the parameter space, 
with more parameters requiring a closer spacing of temper¬ 
atures, illustrated by Figure 3. This can be understood by 
looking at the expectation and variance of log L at a partic¬ 
ular temperature (see Figure 2), 


E[log L\ t 


nT 

~Y 


and 


Var[log L\t 


nT 2 


( 6 ) 


Note that the specific heat from (3) is a constant n/ 2, as 
expected. 

Since the acceptance ratio Aij depends on logL^ — 
log Lj, the more separate the distributions of logL^ and 
log Lj at their respective temperatures, T* and Tj, the lower 
the acceptance ratio between such chains will be. For two 
chains at temperatures T and 7 T, the separation of the 
means of pr and p 7 T, in units of the standard deviation 
at T, will be 


E[logL]r-E[logL] 7 T ^ ^ fn _ ^ 

+Var[log L]t V 2 

It follows that - for constant 7 - as the dimension n 
increases, so the acceptance ratio between chains at tem¬ 
peratures T and 7 T falls. For a higher dimensional target 
distribution, therefore, a closer spacing of temperatures is 
required for a given acceptance ratio. 

For more general distributions, by considering the over¬ 
lap of pr(}ogL) at different temperatures, Falcioni & Deem 
(1999) argue that the number of temperatures N required 
to efficiently sample the posterior distribution should scale 
with A log L/y/n, where A logL is the range of E[log L]t 
between T — 1 and T = Tp r i or . That is: 


E[log L\\ -E[logL] Tprior 
\fn 


( 8 ) 


Since the log likelihood range A log L itself depends on 
the dimension of the system n, it is difficult to apply this 
relation in practice. However, for the ideal Gaussian, we can 
see from ( 6 ) that A log L scales with n, and so N scales with 
y/n, as we might expect. 


2.3 The Kullback—Leibler divergence 

Another measure of the optimal spacing of temperatures 
is the Kullback-Leibler (KL) divergence between adjacent 
chains. The KL divergence from a hot distribution tttj to a 
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cold distribution 7 ft* , 

[^(0) log ^4 d$, (9) 

J 7T Tj (0) 

quantifies the information gained about the posterior with 
each step down the temperature ladder, from the prior p = 
7Tt =00 to the posterior 7 r = 7 Tt=i- It is reasonable to expect 
that for an optimally-spaced ladder - that is, one with a 
minimal ACT on the cold chain for a given number of chains 
- the information gain should be uniform for every step down 
the ladder. 

For the example of the ideal Gaussian of Section 2.2, 
the KL divergence is, straightforwardly, 

^KL(7r T j7r Tj ) = - +log 7 ij - 1^ . (10) 

Like the swap acceptance ratio, therefore, uniform KL diver¬ 
gence over the entire ladder is also achieved by a geometric 
spacing of temperatures for the ideal Gaussian. 

Unfortunately, unlike the acceptance ratio, the KL di¬ 
vergence is difficult to compute numerically while sampling, 
owing to the unknown - and temperature-dependent - evi¬ 
dence (normalisation) values on 7 and 7 iTj . 

We henceforth assume that spacing temperatures for 
uniform acceptance ratios is a reasonable approximation of 
a ladder that is optimal in the ACT of the cold chain. We 
make this assumption on faith and, while we briefly examine 
its validity in Section 4.1 and Section 6.1, it invites a more 
careful study. 


3 ADAPTIVE TEMPERATURE LADDERS 

From the arguments in Section 2 and the references therein, 
we shall assume that uniformity of acceptance ratios pro¬ 
vides a good approximation to the optimal temperature lad¬ 
der for parallel tempering problems. In this section, we de¬ 
scribe an algorithm for dynamically adapting chain temper¬ 
atures to achieve uniform acceptance ratios for inter-chain 
swaps. 

From (2), as 1/7} — 1/7} —> 0, Aij —> 1, so in order 
to increase the expected acceptance ratio between chains, it 
suffices to move them closer together in temperature space; 
conversely, to reduce E[7bj], we can push the chains apart. 
We will henceforth adopt the notation that Ai = 1 

and that 7} < 7}+i, with 7\ = 1 being the untempered or 
cold chain (which samples from the target distribution, 7 r). 
Here, Ai(t) are the instantaneous acceptance ratios between 
chains, but we shall shortly describe the discrete case where 
empirical measurements of Ai are collected with each itera¬ 
tion of the sampler. 

3.1 Dynamics 

Our goal is to dynamically adjust the temperatures of the 
chains to achieve uniform acceptance ratios as we sample 
the target distribution. We define our temperature dynamics 
in terms of the log of the temperature difference between 
chains, 

Si = log(7} — Ti-i). (11) 


Under this scheme, finite changes to Si will always preserve 
the correct ordering of temperatures (7\ < ... < Tn). 

To achieve the same Ai for all chains, we can drive the 
gap Si according to the acceptance ratios between chain i 
and those immediately above and below, to wit 

^ = K(t) [Aft) - A i+1 (t)} , (12) 

for 1 < i < TV, where k is a positive constant controlling 
the time-scale of the evolution of 7}. k can be interpreted 
as the instantaneous exponential time-constant for temper¬ 
ature adjustments. The two extremal temperatures, 7\ and 
Tjv, are fixed (see below). 

Under this scheme, chain i will attempt to increase the 
gap in temperature space between itself and chain (i + 1 ) 
if swaps are accepted too often and close it when they are 
accepted too seldom — and similarly for chain {i — 1 ) — 
equilibrating at Ai that are uniform over i. Therefore, for an 
appropriate choice of Hi - discussed momentarily - these rules 
drive the chains i = {2,..., N — 1} toward even acceptance 
spacing. 

However, in order to efficiently sample a target distribu¬ 
tion with strongly separated modes (such that a traditional 
MCMC sampler would be unable to traverse the “valleys” 
between them), Tn must be high enough that the modes are 
flattened out and the chain can explore the entire parameter 
space unhindered. This amounts to the topmost chain sam¬ 
pling from the prior distribution 4 , which we achieve trivially 
by setting the inverse temperature of this chain as /3n = 0 . 

This continuous system is discretised as 

Si(t + 1) - Si(t) K(t) [Ai(t) - A i+1 (t)] , (13) 

where Ai(t) are the acceptance ratios accumulated by the 
sampler at the current iteration. 

The values of Ai are measured empirically at each it¬ 
eration as the fraction of swap proposals between chains 
that were accepted. For a traditional sampler comprising 
one sample per chain, these will be either 0 or 1. For en¬ 
semble samplers, however, comprising n w distinct walkers 
per temperature, the measurements of Ai are less granular, 
such that Ai G {x £ [0, l]\n w x G Z}. In general, fewer walk¬ 
ers require a longer averaging time-scale - discussed below 
- in order to smooth out this granularity. 

Importantly, the temperature adjustment scheme we 
have proposed - and, more generally, any adaptive sampling 
scheme - in fact violates the condition for detailed balance 
that ensures that an MCMC sampler will converge to the 
target distribution. Roberts & Rosenthal (2007) investigate 
the conditions required of such an adaptive sampler for it to 
be ergodic in the target distribution - that is, that it will 
converge on long time-scales. They determine (from their 
Theorem 1 and Corollary 4) that diminishing the amplitude 
of adaptations in the transition kernel with each iteration 
is sufficient for the sampler to be ergodic in the target dis¬ 
tribution. We therefore suppress temperature adjustments 
to ensure that the sampler is Markovian on sufficiently long 
time-scales 5 . 

The rate of diminution of temperature adjustments is a 

4 For analytic priors, this special case, where the likelihood is 
ignored, can be treated separately by having the sampler draw 
independent samples directly from the prior. 

5 In principle, of course, we could stop temperature adjustments 
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trade-off between the rate of convergence of the temperature 
ladder and that of the sampler itself toward its stationary 
distribution. We modulate the dynamics with hyperbolic de¬ 
cay to suppress the dynamics on long time-scales, 


where to is the time at which the temperature adjustments 
have been reduced to half their initial amplitude. The initial 
amplitude of adjustments is in turn set by v, the time-scale 
on which the temperatures evolve at early time. 

3.2 Parameter choice 

tn the scheme of (13) and (14), there are two parameters to 
choose: to and v. The dynamical time parameter t in (13) is 
measured in units of intra-chain jumps of the sampler, with 
temperature adjustments being made at every iteration. 

The lag parameter to sets the time-scale for the atten¬ 
uation of temperature adjustments. This decay factor in n 
is included as a fail-safe mechanism to ensure that, even 
for target distributions on which the temperature dynamics 
fail to find an equilibrium set of temperatures, the ladder 
will always converge over long time-scales. This condition 
guarantees that the sampler correctly explores the target 
distribution. 

From (14), the time-scale of the dynamics at late time 

- when t ^ to - is ut/to. To ensure that temperatures have 
time to find an equilibrium over the course of a run, we 
therefore require that to v, so that the dynamics will 
always be on a time-scale much shorter than the current 
run time. However, we should also ensure that, over the 
course of the run, the dynamical time-scale is longer than 
the ACT of the sampler, so that the temperatures respond 
to the correct posterior distribution. To this end, we require 
that vN t to, where N T is the number of independent 
samples gathered over the course of the run. For example, 
if N t = 100, these two conditions are satisfied by to = lOu, 
and for our test cases, we have indeed found this choice to 
work well. 

Meanwhile, the time-scale of the dynamics at early time 

- when t <C to - is v. A good choice of v should therefore 
ensure that the sampler is not susceptible to large statistical 
errors on the measurements of the acceptance ratios A*. 

In general, for n s swap proposals, the acceptance count 
n s Ai is a random variable that follows a binomial distribu¬ 
tion B(n s ,E[Ai\), so that Ai has variance 

= ElAKl-EM) . 

n s 

Since the dynamical equations (13) are linear in Ai, they 
will be driven by the means, E[A*], on long time-scales, as¬ 
suming that the noise in the system from counting errors - 
proportional to 1 /y/nl - does not cause short-term changes 
in E [Ai\. 

Given a sampler of n w walkers, n w swaps are proposed 
with each iteration, so that n s m n w v. To ensure stable 
dynamics at early time, we should therefore choose n w v 

1. 

altogether once the temperatures have reached an equilibrium, 
discarding the previous samples as part of the burn-in. 


A good choice of v depends on the response of E [Ai] to 
changes in the relevant chains’ temperatures, and therefore 
depends on the particular likelihood function that is being 
sampled. However, if E [Ai] will eventually be of order, say, 
0.25, and we want the measurements of Ai to be between 0.2 
and 0.3, then we should average Ai over at least 100 swap 
proposals, giving v > 100 /n w . 

Combining these criteria on v and to, we therefore sug¬ 
gest default parameter values of v — 10 2 /n w and to = 
1 0 3 /n w . 


4 EXAMPLES 

We have implemented the algorithm proposed above as a 
modification to the ensemble sampler emcee of Foreman- 
Mackey et al. (2013). Our implementation can be found at 
https://github.com/willvousden/ptemcee. 

In this section we apply our implementation to specific 
examples in order to understand how and when the tradi¬ 
tional geometric spacing fails and how much the uniform-A 
strategy might help us. We present the following test cases. 

(i) In Section 4.1 we compare the uniform-A strategy used 
by the temperature dynamics of Section 3 with the alter¬ 
native strategy of uniform KL divergence discussed in Sec¬ 
tion 2.2 on the example of a unimodal truncated Gaussian 
likelihood. 

(ii) In Section 4.2 we test the dynamics on a more com¬ 
plex, bimodal distribution for various choices of the num¬ 
ber of chains N. We compare the resulting ACTs of the 
sampler with those of another sampler using a geomet¬ 
ric ladder whose maximum temperature is fixed such that 
T max ~ Tp r i or . 

(iii) In Section 4.3 we test the algorithm against the more 
difficult egg-box distribution with 243 modes. For compari¬ 
son, we sample from the same distribution with a geometric 
ladder constructed to yield 25% acceptance ratios when ap¬ 
plied to the ideal Gaussian discussed in Section 2.2. 

For all of these tests, v — 10 2 and to — 10 3 are used 
to control the dynamics in (14), while the sampler uses 100 
walkers. Note that these choices, while different from the 
defaults proposed in Section 3.2, do satisfy the conditions 
described in that section. 


4.1 Truncated Gaussian 

Our first test case is an n-dimensional, unimodal, unit Gaus¬ 
sian similar to that of Section 2.2 but with finite prior vol¬ 
ume. The simplicity of this case admits some exact analysis 
before recourse to numerics, which allows us to test the ap¬ 
proximations made in Section 2.2. 

At low temperatures, where the prior boundaries do 
not truncate much of the likelihood probability mass, the 
optimal temperature spacing should be similar to that of 
the ideal Gaussian. By imposing a step-like cut-off in the 
prior at a radius of R , there will be some temperature at 
which this approximation will fail and a geometric spacing 
becomes inappropriate. 

For the likelihood we use the same distribution as in 
Section 2.2, while for the prior we use a uniform distribution 
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over the closed n-ball of radius R = 30, centred on the origin. 
The likelihood and prior are defined by 


Lid) ocexp(-§ ||0|| 2 ) , 
p(0) oc 


1 if \\0\\ < R, 
0 otherwise, 


(15) 

(16) 


where || • || is the Euclidean norm on R n . Subsequently, the 
normalised posterior generated by (15) and (16) at temper¬ 
ature T is 


7Tt(0) 


(2 ttT) 2r(f) 
iy2’2Tj 



0 


if \\6\\ < R, 
otherwise, 


(17) 


where y(a, z) is the lower incomplete gamma function. 

In the low-temperature limit, this distribution con¬ 
verges to the ideal Gaussian distribution. We should there¬ 
fore expect the KL divergence for a step down the tempera¬ 
ture ladder to asymptote to (10) as T 0, where the effects 
of the prior boundary are negligible 6 . Indeed, the KL diver¬ 
gence of (17) from T 2 to Ti is available analytically as 


Dkl = 


(T 2 — Ti) 7 (l • 


n R*_\ 
2 ’ 2 T 2 ) 


T 27 


(n R 2 5 

y 2 ’ 2 T 1 ) 


( T * 

+ 2 log \jT 




+ log 


(18) 


2 t 2 J 


( u R 2 \ 
(2’2 T x ) 


If we set T 2 = 7 T 1 (with 7 T 1 <C 1), then 7 (a,z) T(o) as 

Ti —> 0 , and the expression reduces to ( 10 ), as expected. 

Figure 4 illustrates this convergence for n — 5. The 
point on this plot at which the solid line diverges from the 
dashed line, for each 7 , predicts the temperature beyond 
which a geometric spacing of temperatures is no longer op¬ 
timal (for optimality as defined by uniform KL divergence 
between chains). This is caused by truncation of the tem¬ 
pered likelihood by the prior boundaries. 

Of course, since the KL divergence cannot easily be as¬ 
sessed empirically by an MCMC sampler, and we must in¬ 
stead resort to using acceptance ratios, we would like to 
know how consistent these two schemes are outside the as¬ 
sumptions of Section 2 . 2 . 

Figure 5 shows contours of constant Dkl, calculated 
from (18), and contours of constant A*, illustrated by points 
representing temperature pairs (from ladders selected by the 
algorithm developed in Section 3). In the low temperature 
limit, as expected, both schemes select a geometric spac¬ 
ing of temperatures consistent with the ideal Gaussian of 
Section 2.2 (i.e., the contours remain constant in 7 ). At 
higher temperatures, both schemes depart from the geomet¬ 
ric spacing, but they do so differently. The uniform accep¬ 
tance scheme displays a more gradual departure from a ge¬ 
ometric spacing than the contours of constant Dkl ■ The 
smaller 7 selected by the uniform-A scheme outside the ge¬ 
ometric regime, however, suggest that closer spacing is re¬ 
quired in difficult temperature ranges (e.g., across a phase 


6 While we do not consider T < 1 in our simulations, the case of 
T —y 0 can equivalently be thought of as R —> 00 , since the width 
of the Gaussian scales with y/T. 



a 



20 2^- 2 2 2 3 2 4 2 5 2 6 2 7 2 3 2 9 


Tlow 


Figure 4. The KL divergence, or information gain, from a hot 
chain at temperature 7 T\ ow to a colder chain at temperature Ti ow , 
both sampling from (17) at n = 5 (solid lines). As T\ ow —y 0, the 
information gain tends to that of the ideal Gaussian of Section 2.2 
(dashed lines). The lines are ordered vertically to match the leg¬ 
end. 


transition) in order to achieve uniform A than would be 
required for uniform Dkl . There is therefore less risk of a 
large gap in temperature across such a temperature range, 
at the cost of (potentially) slightly less efficient communica¬ 
tion across the rest of the ladder. The uniform-A criterion 
for optimality is therefore conservative with respect to a 
uniform- Dkl criterion. 

We can also visualise the ladder specification in terms 
of the density of chains over temperature. We define this 
density, in logT, as 

X('°<ST) = = ^7 = 7’ (19) 

with 7 = Ti+i/Ti, where T^+i and Ti are the chain temper¬ 
atures to either side of T. 

Figure 6 shows this density for a temperature ladder 
of 20 chains that is in equilibrium under the temperature 
dynamics of Section 3 (the N = 20 contour of Figure 5). 
The density exhibits the expected uniformity of 7 for low 
temperatures but falls for T > 80. The width a of the unit 
Gaussian at temperature T is y/T, so at this temperature the 
prior boundary is at ~ 3cr. At T = 80, ~ 5 % of the likelihood 
mass is truncated - compared to < 0.1% for T — 40 and 
~ 35% for T — 160 - indicating that the prior boundary 
becomes significant in this temperature regime. 

This drop in density reflects the convergence of the tem¬ 
pered posterior distribution, 7 tt, toward the prior as T — »> 00 . 
As 7 tt becomes flatter, fewer chains are needed per log T to 
maintain good communication. 

Also shown on figure Figure 6 is the square root of the 
estimated specific heat Cv of the system as discussed in Sec¬ 
tion 2 . 1 , which can be seen to track closely the logarithmic 
chain density 77 when appropriately normalised. While the 
provenance of this relationship is unclear, it demonstrates 
the relevance of the specific heat in determining an effective 
temperature ladder. 
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Figure 5 . A contour plot of the KL divergence, or information gain, from a hot chain at temperature Thigh — 7 Ti ow to a colder chain 
at temperature T\ ow , both sampling from the Gaussian likelihood (IT). The coloured lines show the equilibrium A/'-chain temperature 
ladders reached by the temperature dynamics algorithm of Section 3, where the acceptance ratio is the same between any pair of adjacent 
chains. The points on these lines represent pairs of adjacent temperatures (T, 7T) (excluding the top-most, where 7 = 00). 
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Figure 6. Orange: The density of chains per logT under the 
truncated Gaussian distribution (17), where N = 20, n = 25, 
and temperatures are chosen for uniform acceptance ratios be¬ 
tween chains. The chains have equilibrated to 77% acceptance. 
Blue: The square root of the specific heat of the truncated Gaus¬ 
sian distribution, normalised to match the chain density 77 of the 
uniform-A ladder, between T\ and Tjy- 1 - The specific heat CV, 
from (3), is estimated from the sample means of logL over many 
runs with different temperature ladders. 


4.2 Double Rosenbrock function 

The previous test demonstrated how a geometric ladder 
spaces temperatures too closely at higher temperatures, as 
the prior boundary becomes significant. While this may be 


an inefficient use of resources, it at least doesn’t drastically 
inhibit communication between high temperatures and low 
temperatures. Instead, we now turn to a more complex, bi- 
modal likelihood distribution for which a geometric spacing 
might cause bottlenecks in the communication between high 
and low temperatures. 

We use a likelihood derived from the two-dimensional 
Rosenbrock function /: 


L(x,y) oc 


1 


c+ /(*,«/) c + 


f(~x,y)J 


\ 1 /T p 


( 20 ) 


where 


f(x,y) = (a-x) 2 + b(y - x 2 ) 2 . ( 21 ) 

T p is a pre-tempering factor chosen to increase the con¬ 
trast of the distribution, making it harder to sample. When 
T p < 1 , each mode is locally Gaussian, making the re¬ 
sults comparable to the Gaussian example considered in Sec¬ 
tion 2 . 2 . 

For the following tests, we use a = 4, 6=1, c — 0.1, 
and T v — 10 -3 . We use a flat prior on [—10,10] x [—20,100]. 
Figure 7 illustrates this likelihood over the prior volume. 


4-2.1 Test: temperature evolution 

As an illustrative example, we first tested the temperature 
dynamics of Section 3 with the double Rosenbrock poste¬ 
rior distribution (20) using 13 chains. Figure 8 shows the 
evolution of the temperature ladder according to these dy¬ 
namics, while Figure 9 shows the chain density ? 7 (logT) for 
the equilibrated temperature ladder. 

While the equilibrated chains are distributed uniformly 
in logT for T < 50, there is a distinct peak in 77 at T « 800, 
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where a simple geometric spacing of temperatures hinders 
communication between chains. This peak occurs at a phase 
transition where the two modes of the likelihood distribution 
begin to mix and E[logL] changes rapidly with T, indicated 
by the sharp change in specific heat in the bottom panel of 
Figure 9. Since the shape of the likelihood distribution in 
this regime becomes very sensitive to T, a higher density of 
chains is needed to maintain a given acceptance ratio. We 
also note that in the geometric regime (i.e., for low T) the 
specific heat is approximately n/2 — 1 , with E[A] ~ 57%, 
consistent with the values derived for the ideal Gaussian 
from (6) and (5) respectively. 

Ultimately, however, the figure of merit for a tempera¬ 
ture specification in a PTMCMC simulation is the resulting 
ACT for the target temperature (T = 1) of the sampler. We 
must therefore test the performance of the sampler empiri¬ 
cally. 

We use the term ACT to refer to the integrated auto¬ 
correlation time discussed by Sokal (1997), which we esti¬ 
mate according to the algorithm used in the acor package 
(see Appendix A and http://www.math.nyu.edu/faculty/ 
goodman/software/acor/ for details). For the following 
tests, we use the ACT of the first parameter, x , as a measure 
of the efficiency of the sampler (since (20) is bimodal in x 
but unimodal in y ). 


4-2.2 Test: improvement over a geometric ladder 

In Section 2 we claimed that aiming for uniform acceptance 
ratios between chains yields a good temperature ladder. 
Specifically, we expect that a ladder selected for uniform 
acceptance ratios should lead to a lower ACT for the T — 1 
chain than that resulting from a plain geometric ladder. 

The geometric ansatz that we use has a fixed maximum 
temperature such that T/v = 2 x 10 4 . As iV increases, more 
chains are added between T± and Tn, maintaining the geo¬ 
metric spacing. Under this arrangement, the addition of new 
temperatures is not redundant even when Tn is already high 
enough to sample from the prior; the additional chains in¬ 
stead aid inter-chain communication at lower temperatures. 
Since Tn is close to the temperature at which the posterior 
becomes the prior, there is little CPU time wasted in sam¬ 
pling redundantly from the prior with several chains, while 


lower-temperature chains can still communicate with a chain 
sampling from the prior. Under this set-up, therefore, the 
ACT always decreases as N increases, per Figure 10. 

To test the improvement in ACT, r, conferred by our 
temperature dynamics, we allowed emcee to explore the 
target distribution (20) with different numbers of chains, 
N , using both the uniform-A ladders and geometrically 
spaced ladders. The resulting ACTs, r ge o and r ac c, are plot¬ 
ted against N in Figure 10. 

In this example, an iV-chain ladder dynamically 
adapted for uniform acceptance ratios clearly outperforms 
a geometrically spaced ladder of the same size for all N. 

The benefit of a uniform-A ladder is most pronounced 
at low N - i.e., where there are few chains available. In this 
regime, the sampler will be more sensitive to phase transi¬ 
tions, since the bigger gaps in temperature could cause se¬ 
vere bottlenecks in communication across the temperature 
ladder. 

When N is large, the differences in acceptance ratios 
between a geometric ladder and one chosen for uniform A 
becomes less significant. In this case, the difference between 
the limiting (minimum) acceptance ratio for a ladder and the 
ladder’s average acceptance ratio is proportionally smaller. 

In the case of the double Rosenbrock distribution (20), 
we have found that, once the minimum acceptance ratio 
for a geometric ladder (terminating at T max = 2 x 10 4 ) ex¬ 
ceeds ~ 10%, reallocating temperatures for uniform accep¬ 
tance ratios does not reduce the measured ACT by more 
than 25%. This occurs when N & 7 in the current example. 
Nonetheless, there remains an overall improvement in ACT 
regardless of N. 

Figure 10 also shows, in the middle pane, the total num¬ 
ber of iterations per independent sample across all chains. 
This quantity, given by N x r, is proportional to the total 
CPU time of the simulation, while r itself is proportional to 
the CPU time per chain, or wall time , of the simulation. In 
this instance, the CPU time of a run diminishes with N in 
much the same fashion as the wall time does. The fractional 
improvement in CPU time is of course the same as for wall 

time / Tgeo/'7'acc • 


4-2.3 Test: chain removal 

To determine whether a uniform-A temperature placement 
strategy is in fact close to optimal, we assess the contribu¬ 
tion of each chain from such a temperature ladder to the 
efficiency of the sampler, as measured by its ACT. If this 
contribution is equal for all chains, then we can conclude 
that it is indeed optimal to have them all exchanging equally 
- that is, with uniform acceptance ratios. 

To this end, we conducted the following test: 

(i) Sample from (20) with N = 7 chains under the tem¬ 
perature dynamics of Section 3 until the temperatures have 
equilibrated to (Ti,..., TV) to give uniform acceptance ra¬ 
tios. 

(ii) Generate 5 new test ladders, each of 6 chains, formed 
by removing the i th chain from that determined above - i.e., 
(Ti,..., Ti_i, Ti+i,..., TV) - for i — 2,..., 6. 

(iii) Sample from (20) with each of these 5 test ladders 
and calculate the ACTs on the cold chain, r te st- 
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Figure 8. The evolution of ladder of 13 temperatures Ti and acceptance ratios Ai over an emcee run of 10 6 iterations under the 
Rosenbrock likelihood (20). Chains 1 and 13 are not shown, having fixed temperatures T± = 1 and T 13 = 00 . 
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Figure 9. Orange: The equilibrium density of chains per logT 
for the double Rosenbrock run illustrated in Figure 8 , where the 
acceptance ratios have settled to Ai ~ 0.57. Blue: The square 
root of the specific heat for the double Rosenbrock distribution, 
as described in Figure 6 . 


Figure 11 shows the ACTs for the cold chain result¬ 
ing from the test outlined above. While r acc < Rest < T geo 
for ladders of the same IV, r te st increases with the tempera¬ 
ture of the chain that is removed, suggesting that additional 
chains are more useful at higher temperatures. The sharp 
jump in rtest when a chain above T « 200 is removed arises 
from the phase transition that occurs as T approaches T pr i 0 r, 
indicated by a peak in Cv (visible in Figure 9). 


We can understand this behaviour by examining the 
complete autocorrelation functions from which these ACTs 
are estimated. Illustrated in Figure 12, these autocorrelation 
functions exhibit two distinct time-scales. Firstly, there is a 
large autocorrelation for lags <100 for all i - particularly 
i — 2 - corresponding to the ACT of the sampler within 
one of the two modes: that is, the time taken for the sam¬ 
pler to generate an independent sample without changing 
mode. Secondly, there is a visible hump in the autocorre¬ 
lation function for 100 < lag < 2000, corresponding to the 
time taken for the sampler to migrate between modes. Re¬ 
moving the second chain from initial geometric ladder of 7 
chains increases the intra-mode ACT in particular, but does 
not affect the inter-mode ACT. Meanwhile, while removing 
higher temperature chains pushes the secondary hump out¬ 
ward to larger lags, increasing the inter-mode ACT instead. 

The overall autocorrelation time in which we are inter¬ 
ested, discussed by Sokal (1997) and in Appendix A, rep¬ 
resents the time between independent samples of the sys¬ 
tem. It is therefore set by the time-scale on which the sam¬ 
pler migrates to a new mode independently of the current 
mode. Removing a chain at higher temperatures increases 
the inter-modal ACT, and therefore damages the efficiency 
of the sampler. 

Nonetheless, all of the tested temperature ladders 
yielded lower ACT than the default geometric ladder, de¬ 
spite the geometric ladder being chosen with prior knowl- 
edge of 7p r ior- 


4.3 Egg-box in five dimensions 

To test the algorithm’s performance on a yet more strongly 
multi-modal distribution, we use an egg-box distribution de- 
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Figure 11. The cold-chain ACTs for samplers exploring the 
double Rosenbrock distribution (20) per the test described in Sec¬ 
tion 4.2.3. The points denote the ACTs from ladders generated 
according to the scheme in Section 4.2.3. The dashed lines above 
and below identify the ACTs from geometric and uniform-A lad¬ 
ders, respectively, of N = 6. 


Figure 10. Top: the ACTs of x for the cold (T = 1) chain of a 
sampler exploring the double Rosenbrock distribution (20), using 
both uniform-A and geometrically spaced temperature ladders as 
a function of the number of chains N. Middle: the total CPU 
time, N X r, for the runs. Bottom: the relative improvement in 
the ACT for the uniform-A ladder over the geometric ladder. The 
joining lines are provided to guide the eye. 

fined by the likelihood 

/ n \ 1/T * 

m * u n cos Oi + ^ I . (22) 

For a small value of the pre-tempering factor T p the modes 
of this distribution become locally Gaussian, and in the low- 
T regime should therefore generate results similar to those 
of the Gaussian distributions examined in Section 2.2 and 
Section 4.1. For the following tests, we choose T p = 10 -3 . 

We explore this likelihood distribution in 5 dimensions 
over a hat prior on [—L/2,L/2] n , where we choose L = r, 
giving 3 n = 243 modes. 

Rather than compare our uniform-A temperature lad¬ 
der against a geometric ladder with a fixed maximum tem¬ 
perature, as in Section 4.2, we instead use a geometric ladder 
constructed to give a fixed acceptance ratio of E[A] = 0.25 
when applied to the special case of an ideal Gaussian like¬ 
lihood (per Section 2.2). Such a ladder will not, in general, 
give uniform acceptance ratios when applied to an arbitrary 
posterior distribution, but this choice reflects the more re¬ 
alistic scenario where we cannot guess at T pr i or , and so we 
resort to assuming that the distribution indeed behaves like 
an ideal Gaussian. 

Figure 13 shows the evolution of the temperatures and 


acceptance ratios for an EMCEE sampler of 15 chains under 
the temperature dynamics of Section 3. Figure 14 shows the 
equilibrium density ^(logT) after the ladder has achieved 
uniform acceptance ratios. 

Figure 15 shows the ACTs of the cold chain (T = 1) un¬ 
der uniform-A and geometric ladders for the 5-dimensional 
egg-box problem as a function of the number of tempera¬ 
tures available. In this case, adding more temperatures to 
a geometric ladder does not reduce the measured ACT of 
the sampler for N > 7, since they are added to the high-T 
end of the ladder, above T pr i or , and the ratios between lower 
temperatures do not change. Figure 13 shows that from the 
initial geometric ladder only around 6 chains are within the 
range of temperatures spanned by the equilibrium ladder; 
the remaining 8 (excluding T\ — 1 and T/v = oo) are ah 
above T pr i or and effectively sample from the prior. In this 
case, therefore, the geometric spacing that would give uni¬ 
form acceptance ratios of 25% for an ideal Gaussian in fact 
spaces temperatures too widely for > 6 chains. 

Meanwhile, adding more chains to a dynamically 
adapted ladder clearly reduces the ACT of the sampler 
in this regime. Moreover, the ACT of a sampler using a 
uniform-A ladder is always lower than that of a sampler 
using the geometric ladder of the same N. In the egg-box 
example, which requires a relatively close spacing of tem¬ 
peratures, the improvement is dramatic when many chains 
are used: r ge o > 2r aC c for N > 12. 

The failure of the geometric ladders used in this exam¬ 
ple for N > 7 lies in the poor T max chosen by assuming 
that the distribution behaves like an ideal Gaussian. A ge¬ 
ometric spacing is in fact appropriate for a large portion of 
the temperature range, but its efficacy relies on the ladder 
terminating at the correct T pr i or - 

When N < 7, the geometric and uniform-A ladders 
show similar ACTs, and the geometric ladders in fact do 
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Figure 12. The cold-chain autocorrelation function for a sam¬ 
pler exploring the double Rosenbrock distribution (20). The solid 
lines correspond to the ladders generated by the scheme outlined 
in Section 4.2.3, where i is the index of the removed chain. For 
comparison, the dashed and dotted lines represent respectively 
uniform-A and geometric ladders of the same size. The approx¬ 
imate ACTs are 504, 531, 551, 736, and 765, for i = 2,..., 6; 
467 for a uniform-A ladder; and 844 for a geometric ladder (see 
Figure 11). 


slightly better. While unexpected, this is a consequence of 
the behaviour of the affine invariant ensemble sampler used 
in EMCEE (Foreman-Mackey et al. 2013; Goodman V Weare 
2010) as applied to the egg-box likelihood (22). When such 
a sampler is applied to a target distribution for which the 
number of modes n m is greater than the number of walkers 
n w used by the sampler, it behaves as though it is sampling 
from the prior (albeit inefficiently). There is therefore little 
benefit in having a chain sampling as high as T pr i 0 r, and so 
it is better - in terms of the ACT - to assign more chains 
to lower temperatures in order to increase their acceptance 
ratios. In our case, the egg-box likelihood has 243 modes in 5 
dimensions, while the sampler uses only 100 walkers, so these 
walkers tend to become isolated from one another. Since 
the sampler relies on clustering of walkers on an individual 
mode to inform jump proposals within that mode, jumps are 
instead proposed between modes when there are on average 
fewer than one walker per mode. 

We anticipate that running the same tests on a tradi¬ 
tional single-walker MCMC sampler, or reducing the num¬ 
ber of modes of the likelihood distribution so that n w n m , 
will dramatically increase T geo /T acc in the low temperature 
regime. We should expect that r ge0 r ac c when T max (iV) <C 
^prior for the geometric ladder and that r ge 0 ~ T acc when 
T max (iV) « TpHor. 

Figure 15 therefore illustrates a very specific case for 
N < 7 that does not reflect the importance of choosing 
T max ~ Tp r i or . Nonetheless, the ACTs of the two tempera¬ 
ture allocation strategies - geometric and uniform-A - are 
still fairly similar for N < 7 and there is a distinct improve¬ 
ment for N > 7. 


5 GRAVITATIONAL WAVE SIGNALS FROM 

COMPACT BINARY COALESCENCES 

We now present an example application of the dynamics of 
Section 3 to a challenging and computationally expensive 
astrophysical inference problem. 

The first direct detections of gravitational waves (GWs) 
by ground-based GW detectors comprising the advanced 
LIGO and Virgo observatories (Aasi et al. 2015; Acernese 
et al. 2015) are anticipated within this decade. Among the 
most promising candidate sources for these detectors are 
extragalactic compact binary coalescences (CBCs) (Abadie 
et al. 2010): the energetic inspiral and merger of pairs of 
neutron stars and/or black holes, emitting a well-modelled 
GW signal. 

For GW astrophysicists, recovery of the source param¬ 
eters of such a signal from the GW signature as observed by 
the detectors is a significant challenge in the application of 
Bayesian statistics (e.g., van der Sluys et al. 2008a,b; Ray¬ 
mond et al. 2010; Rodriguez et al. 2014; Aasi et al. 2013; 
Vitale et al. 2014; Singer et al. 2014; Veitch et al. 2015). 
Parallel-tempered MCMC is one method used within the 
LIGO Scientific Collaboration (LSC) to perform this infer¬ 
ence, so parameter estimation for CBC detections presents 
an ideal test for the scheme outlined in Section 3. 


5.1 Parameter estimation for gravitational waves 

We aim to recover the Bayesian posterior probability density 
7r(0) for the parameters 6 of a merging binary system from 
which a GW signal has been detected, given some detector 
data s and priors p(0). 

The likelihood function L(0; s) for this problem is a 
function of 


(i) the detector output s, and 

(ii) a model waveform h(0 ) representing a putative GW 
signal, 

such that s — h(0 ) + n, where n is the noise in the detector 
and s, /i, and n are time series with Fourier transforms s, h, 
and n respectively. 

Given a noise model for the detector, n is then a random 
variable with a known distribution and L(0; s) can be defined 
in terms of h = s — h(0). Under a simple stationary Gaussian 
noise model this likelihood can be expressed as 


log L{6\ s) = -- 


1 [° 

2 y_ ( 


k'(/)-fe(/;fl)l 2 

Sn(f) 


d f + C, 


(23) 


where C is a normalising constant and S n is the two-sided 
noise power spectral density (PSD). 

The approximate CBC waveform h that determines this 
likelihood has between 9 and 15 physical parameters in 
6. These parameters can be partitioned into two disjoint 
groups; firstly, the intrinsic parameters are those that deter¬ 
mine the dynamics of the source itself: 


• q and M\ the mass ratio and chirp mass of the binary, 
system 7 , 


7 This is an alternative parameterisation of the mass configura¬ 
tion of a binary system with component masses mi and m 2 , such 
that q = m 2 /mi and A4 = (mim^) 3 / 5 /(mi + m^) 1 / 5 . 
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Figure 13. The evolution of temperatures T* and acceptance ratios Ai while sampling with emcee from a 5-dimensional egg-box 
distribution, ( 22 ), with 15 chains. Chains 1 and 15 are not shown, having fixed temperatures T\ = 1 and T 15 = 00. 
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• a and S: the right ascension and declination of the event, 

• ^ and 0 jn: the two angles describing the orbital orien¬ 
tation of the binary system, and 

• t c and (j ) c : a reference time and phase for the waveform. 

A binary system with arbitrary spinning component 
masses is therefore described by 15 parameters. A binary 
system whose spins are aligned with its orbital axis is mod¬ 
elled by 11 parameters, neglecting the spin orientation an¬ 
gles, and a non-spinning system is modelled by only 9 pa¬ 
rameters, with the spin magnitudes also omitted. 

This likelihood function and its parameter space gen¬ 
erate a highly structured posterior distribution with many 
modes and degeneracies that is difficult for conventional 
MCMC samplers to explore. 


Figure 14. Orange: The equilibrium density of chains per logT 
for the egg-box run illustrated in Figure 13, where the acceptance 
ratios have settled to Ai ~ 0.65. Blue: The square root of the 
specific heat for the egg-box distribution, as described in Figure 6 . 

• a\ and a<i\ the spin magnitudes of the binary compo¬ 
nents, and 

• £i, £2, 0 jl, and <£12: the four angles describing the spin 
orientations of the binary components. 

Secondly, the extrinsic parameters are those that deter¬ 
mine only the waveform observed at the detector: 

• (II : the luminosity distance between the binary system 
and the detector, 


5.2 Dynamic parallel tempering 

We implemented the scheme of Section 3 in the MCMC 
sampler used by LAL Inference: the software suite used 
by the LSC for GW parameter estimation (Veitch et al. 
2015). Unlike emcee, the LALInference sampler uses 
only one walker, with jump proposals that are tuned to 
the structure of the posterior distribution generated by a 
CBC signal. Veitch et al. (2015) present a comparison of 
LALInference’s standard parallel tempered MCMC im¬ 
plementation with other sampling techniques, such as nested 
sampling (Skilling 2006; Veitch V Vecchio 2010). They find 
MCMC to be at least competitive with the other methods 
in both CPU time and wall time, and in some cases better. 

To compare our implementation of Section 3 under 
LALInference with the default geometric temperature 
ladder, we tested both schemes on a number of synthetic 
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Figure 15. Top: the ACTs of the cold chain (T = 1) of a sam¬ 
pler exploring the egg-box likelihood (22) with ladders of different 
sizes N , for both geometric temperature ladders and ladders dy¬ 
namically adapted for uniform acceptance ratios. Middle: the 
total CPU time, N x r, for the runs. Bottom: the relative im¬ 
provement in ACT conferred by dynamically adapting for uniform 
acceptance ratios over a geometric ladder. 


GW events simulating the signals received from two differ¬ 
ent compact binary sources. 

We conduct our tests against two non-spinning pro¬ 
totype GW sources: a binary neutron star (BNS) system 
and a binary black hole (BBH) system, detailed in Table 1. 
For each of these prototypes, we simulate coherent detec¬ 
tions by a network of GW detectors, for a range of net¬ 
work signal-to-noise ratios (SNRs), by injecting the com¬ 
puted GW signal into mock Gaussian noise generated from 
the noise PSDs of each detector. We simulate a network 
comprising the Advanced LIGO detectors in Hanford, Wash¬ 
ington and Livingston, Louisiana and the Advanced Virgo 
detector in Cascina, Italy, using noise PSDs that approxi¬ 
mate the detectors’ design sensitivities (The LIGO Scientific 
Collaboration 2010; The Virgo Collaboration 2009). 

The SNR p of a GW detection is a proxy for the maxi¬ 
mum log likelihood, such that AlogL scales as p 2 / 2, where 
A log L is the difference between the maximum log likelihood 
under the signal model and the likelihood of the noise-only 
(h = 0) model. The SNR therefore indicates how sharply 
peaked the posterior distribution will be. Since the SNR of 
a detection can be estimated by the detection pipeline, it 
can also be used to decide the T max used in constructing 
a geometric temperature ladder for that run, against which 
we compare a uniform-A ladder. 


We attempt to recover the parameters of the injected 
events with the likelihood function (23), using two families 
of frequency-domain waveform approximants: 

(i) TaylorF2 , which describes with 9 to 11 free parameters 
the post-Newtonian inspiral of two masses, optionally with 
spins aligned with the orbital axis (Buonanno et al. 2009), 
and 

(ii) IMRPhenomP , which describes the full inspiral- 
merger-ringdown sequence of a CBC, allowing for arbi¬ 
trary precessing spins and having 15 free parameters (in the 
LALInference implementation) (Hannam et al. 2014). 

When recovering with TaylorF2 , we allow for aligned spins 
in the system, while for both approximants, we analytically 
marginalise the reference phase <p c out of the likelihood. For 
our runs, therefore, the TaylorF2 approximant generates a 
10-dimensional parameter space, while IMRPhenomP gen¬ 
erates a 14-dimensional parameter space. 

The posterior distribution for one of these problems, a 
BNS binary recovered with TaylorF2 at an SNR of 25, is 
illustrated in Figure 16 and Figure 17. These show the one- 
and two-dimensional marginal distributions of the recovered 
samples, partitioned into intrinsic and extrinsic parameters. 
Some parameters, such as the chirp mass M, are very accu¬ 
rately measured, while others show multiple modes (e.g., the 
polarisation angle ip) or strong correlations (e.g., distance c?l 
and inclination 0 jn). 

Figure 18 shows the effect of the SNR on the equilib¬ 
rium (uniform-A) chain density p selected by our dynamical 
scheme. While the structure of the temperature ladder is 
preserved, its features scale to higher temperatures as the 
SNR of the injected signal increases while the average value 
of p falls. Since the maximum log likelihood scales with the 
square of the SNR, it follows that, under a fixed prior, T pr i 0 r 
should also increase with the SNR. 

Meanwhile, Figure 19 shows the ratios of ACTs for runs 
using uniform-A ladders versus those using geometric lad¬ 
ders. The lowest SNR that we simulate, 10, represents a 
signal that is on the threshold of detectability, where we 
expect most detections to occur, while the maximum, 25, 
represents a relatively loud signal (at around the 90 th per¬ 
centile of detectable events). 

While there is significant variation in the ACT mea¬ 
surements between SNRs, there is on average a reduction in 
ACT of 26% for the systems and SNRs tested. In general, 
a uniform-A ladder is at least as effective as a geometric 
ladder in ah cases; that is, the ACT ratio T ge o/T aC c is never 
less than one (within error bars). In some cases, this ratio is 
appreciably greater than one, e.g., for low-SNR BNS events. 

However, as we shall discuss briefly in Section 6.1, the 
single-walker nature of the LALInference sampler inhibits 
communication between the extremal chains, so that the 
temperatures are in fact partitioned into two independent, 
non-communicating groups. The improvement we observe in 
Figure 19 therefore arises in fact from more efficient alloca¬ 
tion of the temperatures below the critical temperature of 
the phase transition. Meanwhile, those chains above the crit¬ 
ical temperature - which are sampling in the regime where 
the noise-only model is preferred over the presence of a GW 
signal - remain isolated. 

We anticipate that, with an ensemble sampler with 
similar problem-specific optimisations to those used by 
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LALInference, the communication barrier at the phase 
transition could be removed and we may observe greater 
improvements in ACT and sampling efficiency. 


6 DISCUSSION 

The temperature selection scheme set out in Section 3 solves 
two problems in the application of parallel tempering: 

(i) It identifies T max = oo as a suitable temperature for 
the hot chain - such that it will sample from the prior - that 
is independent of the target distribution. 

(ii) It allocates a fixed number of intermediate tempera¬ 
tures to ensure good communication between fixed extremal 
temperatures T m i n and T max , and therefore efficient sam¬ 
pling of the target distribution - i.e., with few iterations 
between independent samples. 

The intermediate temperatures are allocated so that ac¬ 
ceptance ratios for swaps proposed between neighbouring 
pairs of chains are uniform across the temperature ladder. 
The dynamical algorithm that implements this scheme re¬ 
quires only two parameters: v and to- These parameters, 
discussed in Section 3.2, describe only the initial dynamics 
of the temperatures, setting the time-scale for temperature 
adjustments, and do not determine the equilibrium uniform- 
A ladder. 

While a temperature configuration that is selected for 
uniform acceptance ratios between all chain pairs is not nec¬ 
essarily optimal in the ACT of the sampler, we have demon¬ 
strated that it is generally better than a conventional geo¬ 
metric temperature configuration and provides more consis¬ 
tent behaviour across different likelihood distributions and 
numbers of chains. Importantly, the dynamics that achieve 
such a temperature ladder are simple and easily imple¬ 
mented, requiring very little tuning or intervention. 

The factor by which the ACT is reduced by the uniform- 
A scheme depends strongly on the likelihood distribution 
that is explored and on the specific geometric ladder against 
which the uniform-A scheme is being compared. For a geo¬ 
metric ladder, one must make an ad hoc choice of the max¬ 
imum temperature T max ; this is difficult and a poor guess 
can yield a very sub-optimal ladder. In particular, if T max is 
not high enough that the sampler can efficiently migrate be¬ 
tween modes, then the ACT will be significantly higher than 
it needs to be. On the other hand, if T max is too high, then 
many of the chains will effectively sample from the prior, 
and CPU time will be wasted in sampling from redundant 
tempered likelihood distributions. 

The uniform-A temperature dynamics guarantee that, 
for a given number of chains A, no such wastage of CPU 
time occurs and that there will always be precisely one chain 
sampling at T max = oo (i.e., sampling from the prior). Tests 
of the dynamics generally demonstrate lower ACTs when 
compared with geometric temperature ladders of the same 
number of chains, A. 

In Section 4.2.2 we demonstrated that, even with a ju¬ 
dicious choice of T max that is close to T pr i 0 r, a traditional 
geometric ladder is outperformed by a ladder chosen for 
uniform acceptance ratios (with T max = oo). Figure 10 il¬ 
lustrates that, when T pr i 0 r is known, a uniform-A ladder 
confers the greatest reduction in ACT when A is small. In 


this case, the temperature ratio 7 of the geometric ladder 
is large enough that phase transitions in the distribution of 
log L cause a bottleneck in the communication between hot 
and cold chains around a critical temperature, where A< 1 . 
The uniform acceptance scheme allocates more chains over 
these temperature regimes in an effort to optimise the com¬ 
munication. 

For larger A, Tg eo /r acc ~ 1, suggesting that - as long 
as there are no pairs of chains with prohibitively low swap 
acceptance ratios - a geometric spacing is adequate if T max 
is chosen appropriately. 

It is unclear how to determine the threshold A below 
which communication is impeded, but it is likely related 
to the time-scale of the intra-chain motion of the sampler. 
If intra-chain jumps are accepted seldom with respect to 
the rate of inter-chain swaps, then increasing the inter-chain 
swap acceptance ratio is unlikely to make the sampler any 
more efficient. 

In general, while r ge o > T acc for all A", the improve¬ 
ment fraction T geo /T a cc will asymptote to 1 as A —)> 00 . The 
rate of decay will depend strongly on the target distribu¬ 
tion. A system with a wide distribution of logL (e.g., with 
many dimensions) or with sharp phase transitions at certain 
temperatures (e.g., with many modes of various shapes and 
weights) will see the most benefit from having many chains, 
while a better-behaved distribution without such features 
can be efficiently sampled with fewer. 

Meanwhile, from our tests on the 5-dimensional egg-box 
distribution discussed in Section 4.3, we can see the conse¬ 
quences of a poor choice of T max . While the egg-box dis¬ 
tribution does not have as strong a phase transition as the 
double Rosenbrock function of Section 4.2, our ignorance 
of T pr ior means that a geometric ladder (which in this case 
is constructed from a fixed temperature ratio 7 ) is mostly 
worse than a uniform-A ladder. Figure 15 demonstrates this, 
specifically when A is large enough that for a given 7 the ge¬ 
ometric ladder places many temperatures redundantly above 
T pr ior. In this case, we see a dramatic improvement in ACT 
r from using a uniform-A ladder when compared with a ge¬ 
ometric ladder of the same number of chains A; indeed, the 
ratio T g eo/T a cc becomes as large as ~ 4 for the values of 
A tested. Since T geo is independent of A when A > 7, we 
should expect that this ratio will saturate as A —»• 00 , where 
T aC c reaches a minimum. ]\4oreover, the CPU time, ./V x t of 
the uniform-A runs continues to decrease with A in the ex¬ 
plored range, even as the CPU time of the geometric runs 
rises. 

On the other hand, when A is too small for a geomet¬ 
ric ladder to reach the prior (i.e., Tn <C T pr i or ), we notice 
that in fact T geo < T acc - As discussed in Section 4.3, this 
somewhat surprising result arises from a limitation of the 
ensemble sampler that was used to sample the distribution. 
We anticipate that if the number of walkers were increased 
to many times the number of modes - which is required for 
efficient sampling - the geometric ladder will fail dramati¬ 
cally in this regime of A, giving r geo > T acc . 

Finally, our astrophysical application in Section 5 illus¬ 
trates the use of the scheme of Section 3 in a single-walker 
setting. In this case, a uniform-A ladder improves on the 
default geometric ladder used by LALInference, giving 
on average a 26% reduction in ACT. These tests do, how- 
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Table 1. The CBC event prototypes used to test the adaptation scheme of Section 3. All prototypes are simulated at distances that 
yield 5 different SNRs: 10, 11, 15, 19, and 25. 


Source 

Injection waveform 

q 

M ( Mq ) 

Recovery waveforms 

BNS 

SpinTaylorT4 

0.970 

1.30 

TaylorF2 

BBH 

IMRPhenomP 

0.996 

4.82 

TaylorF2 , IMRPhenomP 
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Figure 16. The one- and two-dimensional marginal distributions of the intrinsic parameters of a BNS event with SNR 25. Note, in 
particular, the very accurate measurement of the chirp mass A4 (the plotted range is only ~ 0.1% of the true value). 


ever, reveal an instability in the uniform-A dynamics under 
a single-walker sampler, which we describe below. 


6.1 Single-walker implementation 

The dynamical algorithm set out in Section 3 can be imple¬ 
mented in a traditional MCMC sampler that uses only a sin¬ 
gle sample per chain, with reductions observed in the ACT 
of the sampler relative to a geometric temperature ladder, 
as shown in Section 5.2. However, in this case, equal (and 
large) acceptance rates between all chains do not guarantee 
good communication of sample positions between extremal 
temperatures. 

A complex, multidimensional posterior such as that de¬ 
scribed in Section 5.2 may exhibit a phase transition. At 
T — 1, the posterior is dominated by the likelihood peaked 
around the true parameter values; at T = oo, it is domi¬ 


nated by the much larger prior volume far away from the 
parameter values, corresponding to a weak or absent signal; 
and at the phase transition temperature, the two contribute 
comparably to the posterior, which is distinguished by two 
distinct likelihood peaks: a high-likelihood peak near the sig¬ 
nal parameters, and a low-likelihood peak in the region of 
significant prior support. In effect, this becomes a reversible- 
jump MCMC problem with two distinct modes: the high- 
likelihood signal model, and the low-likelihood noise-only 
model. 

The dynamical algorithm with a single walker has a ten¬ 
dency to select very small temperature gaps around phase 
transitions. From (2), when A/3 — t 0, A —y 1 regardless of the 
likelihoods of the chains. However, the likelihood distribu¬ 
tions near the phase transition temperature will be distinct 
enough that there is no intra-chain migration of samples 
between them. Consequently, despite efficient communica- 
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Figure 17. The one- and two-dimensional marginal distributions of the extrinsic parameters recovered with TaylorF2 from a BNS event 
with SNR 25. Note the multiple modes for the polarization angle ip and the strong correlation between distance d l and inclination #jn- 


tion between chains, the higher temperatures do not help 
low-temperature walkers to efficiently jump between the two 
modes, since the high- and low-likelihood modes do not mix 
well at any temperature. 

Figure 20 demonstrates how large acceptance ratios do 
not guarantee the transmission of sample positions between 
low and high temperatures when there is only one sample per 
temperature, in this case, 9 samples (whose paths are shown 
as dotted lines) occupy the high-log L part of the parameter 
space representing the signal model, while the remaining 
3 (solid lines) occupy the low-log L part, representing the 


noise-only model. This barrier occurs at the temperature at 
which the evidence for the noise-only model is equal to that 
of the signal model. 


We did not encounter this issue with an ensemble sam¬ 
pler, which is less susceptible to this instability because it 
allows the ensemble at each temperature to occupy both 
models simultaneously. 
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Figure 18. The equilibrium density of chains per logT, from 
(19), for the TaylorF2 BBH runs described in Section 5.2 at var¬ 
ious SNRs. 



P 


Figure 19. The fractional improvements in ACT conferred by 
a uniform-A temperature ladder over a geometric ladder for the 
CBC parameter estimation problem described in Section 5.2 at 
various SNRs. 


6.2 Evidence calculations 

The current paper focuses mainly on the efficiency of a par¬ 
allel tempered MCMC sampler in producing independent 
samples from its target distribution. Another important task 
in Bayesian statistical inference is to compute the evidence 
integral of the posterior distribution. At a given tempera¬ 


ture, this is given by 

Z{P) = j L{e/p(e) d0, (24) 

where [3 = 1/T is the inverse temperature. 

Since we are interested in the untempered posterior, 
we wish to calculate Z( 1). From (24), we can use thermody¬ 
namic integration (Goggans & Chi 2004; Lartillot & Philippe 
2006) to express the log evidence (relative to the prior) in 
terms of the mean logL, such that 

A log Z = log Z(1) — log Z(0) = [ E[log L]^ d/3, (25) 

Jo 


to which the logarithm of the integral of the prior, log Z( 0), 
can be added to give the absolute evidence logZ(l). 

The log evidence can therefore be computed by a sam¬ 
pler through numerical integration of the mean log L values 
collected over all of the chains. In the same way that inter¬ 
chain communication is hindered by phase transitions in the 
system, numerical estimation of this integral is susceptible to 
sharp changes in log L with the temperature T. Such phase 
transitions are marked by a diverging specific heat CV since, 
from (3), Cv is the derivative of logL with respect to T. 

Since allocating temperatures for uniform acceptance 
ratios yields a logarithmic chain density r] that appears to 
scale with y/Cv, such a temperature ladder will naturally 
increase the accuracy of numerical estimates of (25) with 
respect to one that does not increase r) around phase tran¬ 
sitions. 

We can test the degree of improvement conferred by 
a uniform-A ladder by returning to the truncated Gaus¬ 
sian discussed in Section 4.1. Normalising (17) so that 
max log L = 0, the log evidence is 


A log Z 



(26) 


« -55.1, 


with R = 30 and n — 25. 

Figure 21 illustrates the numerical estimates of AlogZ 
from a uniform-A ladder (with T max — oo) and from geomet¬ 
ric ladders with F max = 10 and T max = 10 4 . The evidence 
quadratures for the geometric ladders are augmented with 
a copy of E [log L]T max placed at T — oo as a crude measure 
to cover the integration domain. 

The evidence estimates recovered from these samplers 
are reported in Table 2 for 6 chains and 10, from which it 
is clear that selecting temperatures for uniform acceptance 
ratios can greatly increase the accuracy of the evidence esti¬ 
mate, bypassing the need to select a good initial temperature 
ladder. Note that the under- and over-estimates of A log Z 
from the geometric ladders in this case are a consequence of 
poor choices of T max rather than of sharp changes in E[log L]. 
While these comparisons are reasonable - since for a geomet¬ 
ric ladder it is very difficult to pick an appropriate T max in 
advance - we expect the presence of phase transitions to in¬ 
crease this disparity, and with it the advantages of adapting 
the ladder dynamically for uniform acceptance ratios. 


6.3 Other measures of optimality 

We have investigated the performance of a temperature lad¬ 
der adapted for uniform acceptance ratios in reducing the 
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Figure 20. The paths traced out between temperatures by the 12 samples in a single-walker run with LALInference on a BNS signal 
of SNR 15. Samples are identified by their colour. While swap proposals between chains 9 and 10 are freqently accepted, there is no 
migration of samples starting above chain 10 (solid lines) to chains below 9, and vice versa (dotted lines). 


Table 2. The evidence values of the truncated Gaussian of Sec¬ 
tion 4.1, estimated from a samplers of 6 and 10 temperatures 
allocated in three different ways, as compared to the analytical 
result. 


Temperature ladder 

A log Z 

N = 6 N =10 

Uniform-A: T max = oo 

-58.0 

-55.9 

Geometric: T max = 10 4 

-78.0 

-61.8 

Geometric: T max = 10 

-42.3 

-41.6 

Analytical result 

-55.1 


ACT of a parallel tempered MCMC sampler. The KL di¬ 
vergence discussed in Section 2.2 provides an alternative 
measure of the distance between two temperatures. In Sec¬ 
tion 4.1 we showed that uniform KL divergence in a tem¬ 
perature ladder does not correspond to uniform acceptance 
ratios beyond the special case of the ideal, unbounded Gaus¬ 
sian distribution described in Section 2.2. 

When applied to the truncated Gaussian discussed in 
Section 4.1, for which DKL(7rTj|7TT i+1 ) is analytically avail¬ 
able, the Dkl and A measured between chains drop off at 
different rates as T approaches T pr i 0 r (see Figure 5). In¬ 
deed, it is possible in principle to estimate the temperature- 
dependent normalising constants required to adapt on the 
KL divergence (Geyer 1994; Cameron & Pettitt 2014). It 
may be interesting to investigate such a scheme for resilience 
to the single-walker instability mentioned in Section 6.1. 

Meanwhile, Katzgraber et al. (2006) propose an opti¬ 
misation scheme in which temperatures are chosen to min¬ 
imise the round-trip time of a sample from T m i n to T max , 


which they suggest will improve sampling performance on 
systems with strong phase transitions. Their algorithm is 
tested on simulations of the two-dimensional Ising model, 
and is shown to select a different temperature configuration 
than the uniform-A scheme that has been discussed so far. 

However, the ACT of the sampler - what we are ulti¬ 
mately concerned with in efficient Bayesian inference - is 
not discussed, so it is unclear whether this strategy is better 
than selecting temperatures for uniform acceptance ratios. 
Their feedback optimisation method in fact prefers a higher 
density of chains per T across phase transitions of the sys¬ 
tem than the uniform-A scheme. We have shown, however, 
that the ACT yielded by a particular ladder is not critically 
sensitive to under-densities over phase transitions so long as 
the acceptance ratio is not prohibitively small in these tem¬ 
perature regimes (see Section 4.2.2). Indeed, increasing the 
density of chains over phase transitions too far might unnec¬ 
essarily hinder inter-chain communication at other temper¬ 
atures (by reducing A), leading to an overall rise in ACT. 

These reservations, together with the complicated book¬ 
keeping involved in optimising for round-trip time, lead us 
to favour the dynamical method presented in Section 3. By 
comparison, this dynamical method is simple and guaran¬ 
teed to produce an equilibrium ladder that yields efficient - 
if not perfectly optimal - sampling from any target distri¬ 
bution, with the proviso of many walkers per temperature. 
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P 


Figure 21. An illustration of the thermodynamic quadrature 
estimates of the log evidence of the truncated Gaussian discussed 
in Section 4.1. The shaded area shows the analytical mean logL 
as a function of /3, while the dashed lines illustrate numerical ap¬ 
proximations from the values of log L collected by samplers with 
different ladders, each of 6 temperatures. The resulting evidence 
estimates are reported in Table 2. Note the denser spacing of tem¬ 
peratures in the high-curvature region for the uniform-A ladder, 
and the errors incurred in extrapolating from T ma x to T = oo 
(/3 = 0) for the geometric ladders. 

and Figure 17 were generated with TRIANGLE.PY (Foreman- 
Mackey et al. 2014). This work was supported by the Science 
and Technology Facilities Council and the Leverhulme Trust 
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A reference implementation of the algorithm, based on 
the ensemble sampler emcee (Foreman-Mackey et al. 2013), 
is available at https://github.com/willvousden/ptemcee. 
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APPENDIX A: AUTOCORRELATION TIME 
ESTIMATION 

The autocorrelation time (ACT) discussed in this paper 
refers to the integrated autocorrelation time described by 
Sokal (1997). It is estimated in the following way. 

If x(t) is a time series with a normalised autocorrelation 
function p(t), such that p(0) = 1, then the integrated ACT 
of x is defined by 

oo 

T= Y p ^> 

t= — oo 

oo 

= i+ 2 y 

t =1 

Since, when t r, p{t) ~ 0, there is little contribu¬ 
tion to the integral at large lags, except through noise in 
the measured autocorrelation function p. We can therefore 
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approximate the ACT as 

Mr 

T PS 1+2^2 pit). 
t =1 

We estimate the ACT over a window that is M = 5 
ACTs long, subject to the constraint that Mr < N/ 2, where 
N is the number of samples in x. If this constraint is violated, 
the result is probably not trustworthy, since there are too 
few samples for a meaningful estimate. 
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